Multiple Linear Regression

Ames Housing Data

Author
Affiliation

Nakul R. Padalkar

Boston University

Published

September 29, 2023

Modified

November 6, 2025

1 Setting up the environment

library(gdata)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(kableExtra)
library(tidyverse)
library(stringr)
library(lubridate)
library(scales)
library(graphics)
library(caret)

2 Introduction

2.1 Background

The Ames Housing dataset, an improvement over the well-known Boston Housing dataset, has become an essential tool for budding data scientists and researchers focusing on regression problems. Comprising 2930 observations and many explanatory variables ranging from the quality of streets to the number of fireplaces, it paints a detailed picture of the residential homes in Ames, Iowa.

2.2 Objective

This analysis aims to construct a predictive model that, using the various features of a home, accurately predicts its sale price. Such a model has practical applications for various stakeholders in the real estate market.

2.3 Importance

With a robust prediction model, potential homeowners can gauge if a listing is over or underpriced, sellers can ensure they’re placing a competitive price, and real estate professionals can provide evidence-backed advice to clients.

2.4 Methodology Overview

Our approach is systematic:

  1. Data Exploration - Getting familiar with the data’s structure.
  2. Data Preprocessing - Preparing data for modeling.
  3. Feature Engineering - Optimizing representation of the predictors.
  4. Feature Selection - Using forward selection, backward elimination, and best subset selection.
  5. Model Selection & Training - Choosing an algorithm and training it.
  6. Model Evaluation - Validating the model’s accuracy.
  7. Prediction & Conclusion - Making predictions and concluding the analysis.

3 Data Exploration in Ames Housing Dataset

3.1 Loading the Data

This section focuses on loading the necessary libraries and the dataset for our exploration.

# Loading the required libraries
library(tidyverse)
library(corrplot)
# Loading the Ames Housing dataset
ames_data <- read.csv("../data/AmesHousing2.csv")
ggplot(data = ames_data, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point(color='#336699') +
  labs(x = "Ground Living Area", y = "Sale Price") +
  ggtitle("Scatterplot of Sale Price vs. Gr.Liv.Area")

# Loading the Ames Housing dataset
filtered_ames_data <- ames_data[ames_data$Gr.Liv.Area <= 4000, ]
ggplot(data = filtered_ames_data, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point(color='#336699') +
  labs(x = "Ground Living Area", y = "Sale Price") +
  ggtitle("Scatterplot of Sale Price vs. Gr.Liv.Area")

3.2 Basic Overview

Before diving deeper into the data, it’s essential to get an overall sense of its structure, summary statistics, and a peek into the first few rows.

3.2.1 View the first few rows of the dataset

library(knitr)
library(kableExtra)

# Create a scrollable kable
kable(ames_data[1:10, ], "html") %>%
  kable_styling("striped", full_width = T, position = "center") 
Order Lot.Frontage Lot.Area Lot.Shape Utilities Lot.Config Land.Slope Neighborhood Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built Year.Remod.Add Foundation Bsmt.Unf.SF Total.Bsmt.SF BaseLivArea Central.Air X1st.Flr.SF X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath Bathrooms Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Fireplaces Garage.Type Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Mo.Sold Yr.Sold Sale.Condition SalePrice
117 80 9600 Reg AllPub Inside Gtl NWAmes 1Fam 2Story 6 6 1971 1971 CBlock 386 715 329 Y 930 715 1645 1 2 2 4 1 TA 7 0 Attchd 441 0 78 0 0 0 6 2010 Normal 171000
325 94 9400 Reg AllPub Corner Gtl Mitchel Duplex 2Story 6 5 1971 1971 CBlock 912 912 0 Y 912 912 1824 2 2 3 4 2 TA 8 0 NA 0 128 0 0 0 0 4 2010 Normal 139000
337 70 7700 Reg AllPub Inside Gtl Mitchel Duplex 2Story 5 2 1985 1986 PConc 1216 1216 0 Y 1216 1216 2432 4 2 5 4 2 TA 10 0 Attchd 616 200 0 0 0 0 2 2010 Normal 159000
393 60 9000 Reg AllPub FR2 Gtl NAmes Duplex 2Story 5 5 1974 1974 CBlock 896 896 0 Y 896 896 1792 2 2 3 4 2 TA 8 0 NA 0 32 45 0 0 0 9 2009 Normal 136000
590 NA 13774 IR1 AllPub Inside Gtl NWAmes 1Fam 2Story 7 7 1977 1992 PConc 476 908 432 Y 1316 972 2288 1 2 2 4 1 Gd 8 2 Attchd 520 321 72 0 0 156 11 2009 Normal 230000
667 81 9671 Reg AllPub Corner Gtl NAmes Duplex 2Story 6 5 1969 1969 CBlock 1248 1248 0 Y 1248 1296 2544 2 2 3 6 2 TA 12 0 Attchd 907 0 0 0 0 0 8 2009 Normal 190000
670 91 11643 Reg AllPub Inside Gtl NAmes Duplex 2Story 5 5 1969 1969 CBlock 748 1248 500 Y 1338 1296 2634 2 2 3 6 2 TA 12 0 Detchd 968 0 0 0 0 0 8 2009 Normal 200000
809 64 7018 Reg AllPub Inside Gtl SawyerW Duplex SFoyer 5 5 1979 1979 CBlock 0 1086 1086 Y 1224 0 1224 0 2 1 2 2 TA 6 2 Detchd 528 120 0 0 0 0 6 2009 Alloca 153337
814 64 7040 Reg AllPub Inside Gtl SawyerW Duplex SFoyer 5 5 1979 1979 CBlock 0 1094 1094 Y 1229 0 1229 0 2 1 2 2 Gd 6 2 Detchd 672 120 0 0 0 0 6 2009 Alloca 148325
816 NA 11855 Reg AllPub Inside Gtl SawyerW Duplex 2Story 7 5 2000 2000 PConc 348 1168 820 Y 1168 1619 2787 4 2 5 6 2 TA 8 2 BuiltIn 820 312 0 0 0 0 10 2009 Normal 269500
    # scroll_box(width = "750px", height = "300px")

3.2.2 Display the structure of the dataset

# str(ames_data)
library(skimr)
library(gt)

ames_data |> 
  skimr::skim() |>
  gt::gt()
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character Lot.Shape 0 1.0000000 3 3 0 4 0 NA NA NA NA NA NA NA NA
character Utilities 0 1.0000000 6 6 0 3 0 NA NA NA NA NA NA NA NA
character Lot.Config 0 1.0000000 3 7 0 5 0 NA NA NA NA NA NA NA NA
character Land.Slope 0 1.0000000 3 3 0 3 0 NA NA NA NA NA NA NA NA
character Neighborhood 0 1.0000000 5 7 0 28 0 NA NA NA NA NA NA NA NA
character Bldg.Type 0 1.0000000 4 6 0 5 0 NA NA NA NA NA NA NA NA
character House.Style 0 1.0000000 4 6 0 8 0 NA NA NA NA NA NA NA NA
character Foundation 0 1.0000000 4 6 0 6 0 NA NA NA NA NA NA NA NA
character Central.Air 0 1.0000000 1 1 0 2 0 NA NA NA NA NA NA NA NA
character Kitchen.Qual 0 1.0000000 2 2 0 5 0 NA NA NA NA NA NA NA NA
character Garage.Type 157 0.9464164 6 7 0 6 0 NA NA NA NA NA NA NA NA
character Sale.Condition 0 1.0000000 6 7 0 6 0 NA NA NA NA NA NA NA NA
numeric Order 0 1.0000000 NA NA NA NA NA 1.465500e+03 8.459625e+02 1 733.25 1465.5 2197.75 2930 ▇▇▇▇▇
numeric Lot.Frontage 490 0.8327645 NA NA NA NA NA 6.922459e+01 2.336533e+01 21 58.00 68.0 80.00 313 ▇▃▁▁▁
numeric Lot.Area 0 1.0000000 NA NA NA NA NA 1.014792e+04 7.880018e+03 1300 7440.25 9436.5 11555.25 215245 ▇▁▁▁▁
numeric Overall.Qual 0 1.0000000 NA NA NA NA NA 6.094881e+00 1.411026e+00 1 5.00 6.0 7.00 10 ▁▂▇▅▁
numeric Overall.Cond 0 1.0000000 NA NA NA NA NA 5.563140e+00 1.111537e+00 1 5.00 5.0 6.00 9 ▁▁▇▅▁
numeric Year.Built 0 1.0000000 NA NA NA NA NA 1.971356e+03 3.024536e+01 1872 1954.00 1973.0 2001.00 2010 ▁▂▃▆▇
numeric Year.Remod.Add 0 1.0000000 NA NA NA NA NA 1.984267e+03 2.086029e+01 1950 1965.00 1993.0 2004.00 2010 ▅▂▂▃▇
numeric Bsmt.Unf.SF 1 0.9996587 NA NA NA NA NA 5.592625e+02 4.394942e+02 0 219.00 466.0 802.00 2336 ▇▅▂▁▁
numeric Total.Bsmt.SF 1 0.9996587 NA NA NA NA NA 1.051615e+03 4.406151e+02 0 793.00 990.0 1302.00 6110 ▇▃▁▁▁
numeric BaseLivArea 0 1.0000000 NA NA NA NA NA 4.921840e+02 4.773283e+02 0 0.00 459.5 808.00 5644 ▇▁▁▁▁
numeric X1st.Flr.SF 0 1.0000000 NA NA NA NA NA 1.159558e+03 3.918909e+02 334 876.25 1084.0 1384.00 5095 ▇▃▁▁▁
numeric X2nd.Flr.SF 0 1.0000000 NA NA NA NA NA 3.354560e+02 4.283957e+02 0 0.00 0.0 703.75 2065 ▇▃▂▁▁
numeric Gr.Liv.Area 0 1.0000000 NA NA NA NA NA 1.499690e+03 5.055089e+02 334 1126.00 1442.0 1742.75 5642 ▇▇▁▁▁
numeric Full.Bath 0 1.0000000 NA NA NA NA NA 1.566553e+00 5.529406e-01 0 1.00 2.0 2.00 4 ▁▇▇▁▁
numeric Half.Bath 0 1.0000000 NA NA NA NA NA 3.795222e-01 5.026293e-01 0 0.00 0.0 1.00 2 ▇▁▅▁▁
numeric Bathrooms 0 1.0000000 NA NA NA NA NA 1.756314e+00 6.428715e-01 0 1.00 2.0 2.50 5 ▆▇▅▁▁
numeric Bedroom.AbvGr 0 1.0000000 NA NA NA NA NA 2.854266e+00 8.277311e-01 0 2.00 3.0 3.00 8 ▁▇▂▁▁
numeric Kitchen.AbvGr 0 1.0000000 NA NA NA NA NA 1.044369e+00 2.140762e-01 0 1.00 1.0 1.00 3 ▁▇▁▁▁
numeric TotRms.AbvGrd 0 1.0000000 NA NA NA NA NA 6.443003e+00 1.572964e+00 2 5.00 6.0 7.00 15 ▁▇▂▁▁
numeric Fireplaces 0 1.0000000 NA NA NA NA NA 5.993174e-01 6.479209e-01 0 0.00 1.0 1.00 4 ▇▇▁▁▁
numeric Garage.Area 1 0.9996587 NA NA NA NA NA 4.728197e+02 2.150465e+02 0 320.00 480.0 576.00 1488 ▃▇▃▁▁
numeric Wood.Deck.SF 0 1.0000000 NA NA NA NA NA 9.375188e+01 1.263616e+02 0 0.00 0.0 168.00 1424 ▇▁▁▁▁
numeric Open.Porch.SF 0 1.0000000 NA NA NA NA NA 4.753345e+01 6.748340e+01 0 0.00 27.0 70.00 742 ▇▁▁▁▁
numeric Enclosed.Porch 0 1.0000000 NA NA NA NA NA 2.301160e+01 6.413906e+01 0 0.00 0.0 0.00 1012 ▇▁▁▁▁
numeric X3Ssn.Porch 0 1.0000000 NA NA NA NA NA 2.592491e+00 2.514133e+01 0 0.00 0.0 0.00 508 ▇▁▁▁▁
numeric Screen.Porch 0 1.0000000 NA NA NA NA NA 1.600205e+01 5.608737e+01 0 0.00 0.0 0.00 576 ▇▁▁▁▁
numeric Mo.Sold 0 1.0000000 NA NA NA NA NA 6.216041e+00 2.714492e+00 1 4.00 6.0 8.00 12 ▅▆▇▃▃
numeric Yr.Sold 0 1.0000000 NA NA NA NA NA 2.007790e+03 1.316613e+00 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▃
numeric SalePrice 0 1.0000000 NA NA NA NA NA 1.807961e+05 7.988669e+04 12789 129500.00 160000.0 213500.00 755000 ▇▇▁▁▁

3.2.3 Display summary statistics for the dataset

# Install and load the required library if not already done
library(gt)

summary(ames_data)
#>      Order         Lot.Frontage       Lot.Area       Lot.Shape        
#>  Min.   :   1.0   Min.   : 21.00   Min.   :  1300   Length:2930       
#>  1st Qu.: 733.2   1st Qu.: 58.00   1st Qu.:  7440   Class :character  
#>  Median :1465.5   Median : 68.00   Median :  9436   Mode  :character  
#>  Mean   :1465.5   Mean   : 69.22   Mean   : 10148                     
#>  3rd Qu.:2197.8   3rd Qu.: 80.00   3rd Qu.: 11555                     
#>  Max.   :2930.0   Max.   :313.00   Max.   :215245                     
#>                   NA's   :490                                         
#>   Utilities          Lot.Config         Land.Slope        Neighborhood      
#>  Length:2930        Length:2930        Length:2930        Length:2930       
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>                                                                             
#>   Bldg.Type         House.Style         Overall.Qual     Overall.Cond  
#>  Length:2930        Length:2930        Min.   : 1.000   Min.   :1.000  
#>  Class :character   Class :character   1st Qu.: 5.000   1st Qu.:5.000  
#>  Mode  :character   Mode  :character   Median : 6.000   Median :5.000  
#>                                        Mean   : 6.095   Mean   :5.563  
#>                                        3rd Qu.: 7.000   3rd Qu.:6.000  
#>                                        Max.   :10.000   Max.   :9.000  
#>                                                                        
#>    Year.Built   Year.Remod.Add  Foundation         Bsmt.Unf.SF    
#>  Min.   :1872   Min.   :1950   Length:2930        Min.   :   0.0  
#>  1st Qu.:1954   1st Qu.:1965   Class :character   1st Qu.: 219.0  
#>  Median :1973   Median :1993   Mode  :character   Median : 466.0  
#>  Mean   :1971   Mean   :1984                      Mean   : 559.3  
#>  3rd Qu.:2001   3rd Qu.:2004                      3rd Qu.: 802.0  
#>  Max.   :2010   Max.   :2010                      Max.   :2336.0  
#>                                                   NA's   :1       
#>  Total.Bsmt.SF   BaseLivArea     Central.Air         X1st.Flr.SF    
#>  Min.   :   0   Min.   :   0.0   Length:2930        Min.   : 334.0  
#>  1st Qu.: 793   1st Qu.:   0.0   Class :character   1st Qu.: 876.2  
#>  Median : 990   Median : 459.5   Mode  :character   Median :1084.0  
#>  Mean   :1052   Mean   : 492.2                      Mean   :1159.6  
#>  3rd Qu.:1302   3rd Qu.: 808.0                      3rd Qu.:1384.0  
#>  Max.   :6110   Max.   :5644.0                      Max.   :5095.0  
#>  NA's   :1                                                          
#>   X2nd.Flr.SF      Gr.Liv.Area     Full.Bath       Half.Bath     
#>  Min.   :   0.0   Min.   : 334   Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:   0.0   1st Qu.:1126   1st Qu.:1.000   1st Qu.:0.0000  
#>  Median :   0.0   Median :1442   Median :2.000   Median :0.0000  
#>  Mean   : 335.5   Mean   :1500   Mean   :1.567   Mean   :0.3795  
#>  3rd Qu.: 703.8   3rd Qu.:1743   3rd Qu.:2.000   3rd Qu.:1.0000  
#>  Max.   :2065.0   Max.   :5642   Max.   :4.000   Max.   :2.0000  
#>                                                                  
#>    Bathrooms     Bedroom.AbvGr   Kitchen.AbvGr   Kitchen.Qual      
#>  Min.   :0.000   Min.   :0.000   Min.   :0.000   Length:2930       
#>  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   Class :character  
#>  Median :2.000   Median :3.000   Median :1.000   Mode  :character  
#>  Mean   :1.756   Mean   :2.854   Mean   :1.044                     
#>  3rd Qu.:2.500   3rd Qu.:3.000   3rd Qu.:1.000                     
#>  Max.   :5.000   Max.   :8.000   Max.   :3.000                     
#>                                                                    
#>  TotRms.AbvGrd      Fireplaces     Garage.Type         Garage.Area    
#>  Min.   : 2.000   Min.   :0.0000   Length:2930        Min.   :   0.0  
#>  1st Qu.: 5.000   1st Qu.:0.0000   Class :character   1st Qu.: 320.0  
#>  Median : 6.000   Median :1.0000   Mode  :character   Median : 480.0  
#>  Mean   : 6.443   Mean   :0.5993                      Mean   : 472.8  
#>  3rd Qu.: 7.000   3rd Qu.:1.0000                      3rd Qu.: 576.0  
#>  Max.   :15.000   Max.   :4.0000                      Max.   :1488.0  
#>                                                       NA's   :1       
#>   Wood.Deck.SF     Open.Porch.SF    Enclosed.Porch     X3Ssn.Porch     
#>  Min.   :   0.00   Min.   :  0.00   Min.   :   0.00   Min.   :  0.000  
#>  1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:  0.000  
#>  Median :   0.00   Median : 27.00   Median :   0.00   Median :  0.000  
#>  Mean   :  93.75   Mean   : 47.53   Mean   :  23.01   Mean   :  2.592  
#>  3rd Qu.: 168.00   3rd Qu.: 70.00   3rd Qu.:   0.00   3rd Qu.:  0.000  
#>  Max.   :1424.00   Max.   :742.00   Max.   :1012.00   Max.   :508.000  
#>                                                                        
#>   Screen.Porch    Mo.Sold          Yr.Sold     Sale.Condition    
#>  Min.   :  0   Min.   : 1.000   Min.   :2006   Length:2930       
#>  1st Qu.:  0   1st Qu.: 4.000   1st Qu.:2007   Class :character  
#>  Median :  0   Median : 6.000   Median :2008   Mode  :character  
#>  Mean   : 16   Mean   : 6.216   Mean   :2008                     
#>  3rd Qu.:  0   3rd Qu.: 8.000   3rd Qu.:2009                     
#>  Max.   :576   Max.   :12.000   Max.   :2010                     
#>                                                                  
#>    SalePrice     
#>  Min.   : 12789  
#>  1st Qu.:129500  
#>  Median :160000  
#>  Mean   :180796  
#>  3rd Qu.:213500  
#>  Max.   :755000  
#> 

3.3 Histogram of SalePrice

To understand the distribution of our target variable, SalePrice, we’ll visualize it using a histogram.

# Plotting a histogram for SalePrice
ggplot(data = ames_data, aes(x = SalePrice)) +
  geom_histogram(fill = '#336699', color = "black", bins = 50) +
  labs(title = "Histogram of SalePrice", x = "SalePrice", y = "Frequency") +
  theme_minimal()

3.4 Correlation Analysis

3.4.1 Correlation with respect to SalePrice

First, let’s identify how the continuous variables in our dataset correlate with SalePrice.

# Calculating correlations of continuous variables with SalePrice
continuous_vars <- ames_data %>% select_if(is.numeric)
correlations <- cor(continuous_vars)
saleprice_correlations <- correlations['SalePrice',]
saleprice_correlations_df <- data.frame(Variable = names(saleprice_correlations), 
                                        Correlation = saleprice_correlations) %>%
  arrange(desc(Correlation))

# Plotting these correlations
ggplot(data = saleprice_correlations_df, aes(x = reorder(Variable, Correlation), y = Correlation)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Correlation of Continuous Variables with SalePrice", x = "", y = "Correlation") +
  theme_minimal()
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).

3.4.2 Correlation among Continuous Variables

Understanding how continuous variables correlate with each other can provide insights into potential multicollinearity and relationships between variables.

# Remove columns with zero variance or too many NAs
filtered_continuous_vars <- continuous_vars %>% 
  select_if(function(x) var(x, na.rm = TRUE) > 0 & mean(is.na(x)) < 0.5)

# Recalculate correlations
fixed_correlations <- cor(filtered_continuous_vars, use = "complete.obs")

# Replace NA with 0 in correlation matrix
fixed_correlations[is.na(fixed_correlations)] <- 0

# Plotting correlations among continuous variables
corrplot(fixed_correlations, method = "color", order = "hclust", addCoef.col = "black", tl.cex = 0.75, tl.srt = 45, type= 'lower',number.cex= 0.60)

4 Data Preprocessing

Data preprocessing in the context of the Ames Housing dataset is a crucial phase where raw data is cleaned and transformed to make it suitable for modeling. It entails addressing missing values, which might indicate absent housing features, ensuring numerical attributes are on a consistent scale through feature scaling, and converting categorical attributes into a machine-readable format using encoding techniques. Additionally, the dataset might benefit from feature engineering, where new attributes are derived or existing ones are modified to better capture underlying patterns in the housing market. Lastly, a train-test split is performed, partitioning the dataset to ensure models can be trained on one subset and validated on another, allowing for an assessment of their real-world performance.

  1. Handling Missing Values: In the Ames Housing dataset, missing values might indicate the absence of a feature in a house (like a garage). We can impute these missing values using statistical methods, like replacing with the median for numerical features or the mode for categorical ones, ensuring our models have complete data to learn from.
  2. Feature Scaling: In datasets like Ames Housing, with numerous numerical features like area or age, feature scaling ensures all numerical attributes have the same scale. This is essential for algorithms like gradient descent or k-NN, where larger-scale features can disproportionately impact the model.
  3. Encoding Categorical Variables: The Ames dataset contains categorical variables, like neighborhood or house style. Encoding transforms these text-based categories into numerical format, often using techniques like one-hot encoding. This conversion ensures machine learning algorithms, which require numerical input, can process this data.
  4. Feature Engineering: Feature engineering involves creating new features from the existing ones in the Ames dataset or modifying current attributes to represent the underlying patterns better. For instance, combining ‘1st Floor Area’ and ‘2nd Floor Area’ to get ‘Total Floor Area’ might give a model a more consolidated view of a property’s size.
  5. Train-Test Split: To assess the performance of our models on unseen data, we divide the Ames dataset into a training set (to train the model) and a testing set (to test its predictions). A standard split might allocate 70% of the data for training and the remaining 30% for testing, ensuring our models are well-trained and well-validated.

4.1 Handling Missing Values

# List columns with missing values
missing_cols <- colnames(ames_data)[colSums(is.na(ames_data)) > 0]

For simplicity, we’ll impute missing values with the median for numeric columns and the mode for categorical columns

for (col in missing_cols) {
  if (is.numeric(ames_data[[col]])) {
    ames_data[[col]][is.na(ames_data[[col]])] <- median(ames_data[[col]], na.rm = TRUE)
  } else {
    mode_val <- as.character(names(sort(table(ames_data[[col]]), decreasing = TRUE)[1]))
    ames_data[[col]][is.na(ames_data[[col]])] <- mode_val
  }
}

4.2 Feature Scaling - Normalize numeric features

# Not this time
# numeric_cols <- sapply(ames_data, is.numeric)
# ames_data[numeric_cols] <- lapply(ames_data[numeric_cols], scale)

4.3 Encoding Categorical Variables

# Not Needed for Now
# ames_data <- model.matrix(~ . - 1, data = ames_data)

4.4 Feature Engineering

We won’t perform feature engineering here for simplicity, but this is where you’d create or modify features based on domain knowledge or insights from EDA.

5 5. Train-Test Split

set.seed(123) # For reproducibility
sample_index <- sample(seq_len(nrow(ames_data)), size = 0.7 * nrow(ames_data))
train_data <- ames_data[sample_index, ]
test_data <- ames_data[-sample_index, ]

6 Feature Selection

6.1 Simple Linear Regression Model

Fitting a simple linear model between SalePrice and Gr.Liv.Area can be achieved using R’s lm() function.

Here’s how you can create and summarize such a model:

options(scipen=999)
# Fit the linear model on the training data
lm_model <- lm(SalePrice ~ Gr.Liv.Area, data = train_data)

# Summarize the model
# stargazer::stargazer(lm_model, type = "html", title = "Simple Linear Model", ci=TRUE, single.row = FALSE, no.space = FALSE, align = TRUE, digits=5, font.size = "small",  report = "vc*stp")
summary(lm_model)
#> 
#> Call:
#> lm(formula = SalePrice ~ Gr.Liv.Area, data = train_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -477601  -28912   -2320   21280  335224 
#> 
#> Coefficients:
#>              Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 15969.885   3852.872   4.145            0.0000354 ***
#> Gr.Liv.Area   110.179      2.438  45.199 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 55870 on 2049 degrees of freedom
#> Multiple R-squared:  0.4993, Adjusted R-squared:  0.499 
#> F-statistic:  2043 on 1 and 2049 DF,  p-value: < 0.00000000000000022

The model above tells you that for one unit increase of Gr.Liv.Area Sale Price increases by 15969.8854154 + 110.1791899= 16080.0646053 indicated by the equation SalePrice = 15969.8854154 + 110.1791899 \times Gr.Liv.Area.

6.2 Multiple Linear Regression Model

Let’s run a multiple linear regression model using all available predictors in the train_data to predict SalePrice.

Here’s how you can create and summarize such a model:

# Fit the multiple linear regression model on the training data using all predictors
full_lm_model <- lm(SalePrice ~ ., data = train_data)

# Summarize the model
stargazer::stargazer(full_lm_model, type = "html", title = "Full Linear Model", ci=TRUE, single.row = T, no.space = FALSE, align = TRUE, digits=5, font.size = "small",  report = "vc*stp")
Full Linear Model
Dependent variable:
SalePrice
Order 4.61917 (-12.05823, 21.29657)
t = 0.54286
p = 0.58730
Lot.Frontage -95.08760** (-177.84030, -12.33486)
t = -2.25211
p = 0.02443
Lot.Area 0.53936*** (0.34036, 0.73837)
t = 5.31221
p = 0.0000002
Lot.ShapeIR2 9,703.43300** (1,053.18600, 18,353.68000)
t = 2.19859
p = 0.02803
Lot.ShapeIR3 -27,454.21000*** (-43,011.15000, -11,897.27000)
t = -3.45886
p = 0.00056
Lot.ShapeReg 1,889.03000 (-1,219.37300, 4,997.43400)
t = 1.19110
p = 0.23376
UtilitiesNoSewr -4,265.43600 (-44,714.14000, 36,183.27000)
t = -0.20668
p = 0.83628
Lot.ConfigCulDSac 8,537.96900** (2,024.38900, 15,051.55000)
t = 2.56911
p = 0.01027
Lot.ConfigFR2 -4,781.72300 (-12,971.42000, 3,407.97300)
t = -1.14437
p = 0.25262
Lot.ConfigFR3 2,133.61500 (-15,215.00000, 19,482.23000)
t = 0.24105
p = 0.80955
Lot.ConfigInside 1,494.17000 (-2,029.32400, 5,017.66400)
t = 0.83114
p = 0.40600
Land.SlopeMod 11,062.63000*** (4,526.68100, 17,598.58000)
t = 3.31740
p = 0.00093
Land.SlopeSev -8,812.87300 (-26,974.91000, 9,349.16300)
t = -0.95104
p = 0.34170
NeighborhoodBlueste -710.49220 (-27,308.48000, 25,887.49000)
t = -0.05236
p = 0.95826
NeighborhoodBrDale 3,728.84400 (-16,239.42000, 23,697.11000)
t = 0.36600
p = 0.71441
NeighborhoodBrkSide -13,863.09000 (-31,410.40000, 3,684.21300)
t = -1.54845
p = 0.12168
NeighborhoodClearCr -14,493.15000 (-33,763.86000, 4,777.56000)
t = -1.47405
p = 0.14063
NeighborhoodCollgCr -13,063.47000 (-29,258.25000, 3,131.30000)
t = -1.58100
p = 0.11404
NeighborhoodCrawfor 6,462.82800 (-11,264.91000, 24,190.57000)
t = 0.71452
p = 0.47499
NeighborhoodEdwards -27,054.07000*** (-43,879.12000, -10,229.03000)
t = -3.15155
p = 0.00165
NeighborhoodGilbert -18,456.61000** (-34,129.62000, -2,783.61200)
t = -2.30806
p = 0.02110
NeighborhoodGreens 16,971.38000 (-9,295.00100, 43,237.76000)
t = 1.26638
p = 0.20553
NeighborhoodGrnHill 133,832.70000*** (75,826.81000, 191,838.50000)
t = 4.52208
p = 0.00001
NeighborhoodIDOTRR -20,440.44000** (-39,059.21000, -1,821.67300)
t = -2.15173
p = 0.03155
NeighborhoodLandmrk 6,702.62500 (-51,973.81000, 65,379.06000)
t = 0.22389
p = 0.82287
NeighborhoodMeadowV -3,716.85500 (-23,404.00000, 15,970.29000)
t = -0.37003
p = 0.71140
NeighborhoodMitchel -18,937.37000** (-36,763.47000, -1,111.27000)
t = -2.08215
p = 0.03746
NeighborhoodNAmes -16,121.85000** (-31,760.05000, -483.64530)
t = -2.02058
p = 0.04346
NeighborhoodNoRidge 40,872.70000*** (23,658.26000, 58,087.14000)
t = 4.65359
p = 0.000004
NeighborhoodNPkVill 2,299.06900 (-19,268.72000, 23,866.86000)
t = 0.20893
p = 0.83453
NeighborhoodNridgHt 36,664.38000*** (21,663.44000, 51,665.32000)
t = 4.79042
p = 0.000002
NeighborhoodNWAmes -18,966.09000** (-34,931.51000, -3,000.68100)
t = -2.32834
p = 0.02000
NeighborhoodOldTown -20,682.09000** (-37,682.38000, -3,681.79900)
t = -2.38444
p = 0.01721
NeighborhoodSawyer -17,915.49000** (-34,431.13000, -1,399.86000)
t = -2.12609
p = 0.03363
NeighborhoodSawyerW -17,652.24000** (-33,555.59000, -1,748.87900)
t = -2.17550
p = 0.02972
NeighborhoodSomerst 8,538.23000 (-6,296.94600, 23,373.41000)
t = 1.12804
p = 0.25945
NeighborhoodStoneBr 48,478.99000*** (31,964.51000, 64,993.48000)
t = 5.75356
p = 0.00000
NeighborhoodSWISU -19,674.16000** (-39,302.01000, -46.30389)
t = -1.96459
p = 0.04961
NeighborhoodTimber -2,105.89400 (-20,421.87000, 16,210.09000)
t = -0.22535
p = 0.82174
NeighborhoodVeenker 2,179.58100 (-17,309.48000, 21,668.64000)
t = 0.21919
p = 0.82653
Bldg.Type2fmCon -6,896.65700 (-17,014.84000, 3,221.52400)
t = -1.33593
p = 0.18173
Bldg.TypeDuplex -11.75693 (-12,266.83000, 12,243.32000)
t = -0.00188
p = 0.99850
Bldg.TypeTwnhs -41,073.04000*** (-51,866.13000, -30,279.95000)
t = -7.45863
p = 0.00000
Bldg.TypeTwnhsE -29,786.06000*** (-36,684.70000, -22,887.42000)
t = -8.46248
p = 0.00000
House.Style1.5Unf 6,206.55100 (-10,207.87000, 22,620.97000)
t = 0.74109
p = 0.45873
House.Style1Story 8,663.58800** (1,493.23100, 15,833.94000)
t = 2.36813
p = 0.01798
House.Style2.5Fin -16,788.65000 (-44,887.53000, 11,310.23000)
t = -1.17105
p = 0.24173
House.Style2.5Unf -571.31310 (-16,207.44000, 15,064.81000)
t = -0.07161
p = 0.94292
House.Style2Story -7,381.70000** (-13,470.43000, -1,292.97100)
t = -2.37617
p = 0.01759
House.StyleSFoyer 8,663.96200 (-1,846.15800, 19,174.08000)
t = 1.61569
p = 0.10633
House.StyleSLvl 1,382.75600 (-7,339.72100, 10,105.23000)
t = 0.31071
p = 0.75606
Overall.Qual 11,535.13000*** (9,748.80900, 13,321.45000)
t = 12.65643
p = 0.00000
Overall.Cond 5,122.06300*** (3,640.09500, 6,604.03200)
t = 6.77414
p = 0.00000
Year.Built 364.55260*** (236.31640, 492.78870)
t = 5.57183
p = 0.0000001
Year.Remod.Add 83.40910 (-17.11190, 183.93010)
t = 1.62632
p = 0.10405
FoundationCBlock -1,416.79500 (-7,247.69000, 4,414.10000)
t = -0.47623
p = 0.63397
FoundationPConc 4,860.30800 (-1,620.57400, 11,341.19000)
t = 1.46987
p = 0.14176
FoundationSlab -2,454.59600 (-16,302.93000, 11,393.74000)
t = -0.34740
p = 0.72833
FoundationStone 9,785.99200 (-9,763.33900, 29,335.32000)
t = 0.98112
p = 0.32666
FoundationWood -13,077.21000 (-43,087.81000, 16,933.39000)
t = -0.85406
p = 0.39318
Bsmt.Unf.SF 20.08809 (-92.88957, 133.06580)
t = 0.34849
p = 0.72751
Total.Bsmt.SF -14.90117 (-127.87620, 98.07386)
t = -0.25852
p = 0.79604
BaseLivArea 37.94264 (-74.94773, 150.83300)
t = 0.65875
p = 0.51014
Central.AirY -2,331.77700 (-8,722.74500, 4,059.19000)
t = -0.71510
p = 0.47464
X1st.Flr.SF 11.16909 (-21.56201, 43.90020)
t = 0.66881
p = 0.50370
X2nd.Flr.SF 21.20470 (-10.27377, 52.68317)
t = 1.32028
p = 0.18690
Gr.Liv.Area 27.46222* (-4.73022, 59.65466)
t = 1.67198
p = 0.09469
Full.Bath 6,683.69600*** (2,746.96700, 10,620.43000)
t = 3.32759
p = 0.00090
Half.Bath 2,363.51700 (-1,456.97200, 6,184.00600)
t = 1.21252
p = 0.22547
Bathrooms
Bedroom.AbvGr -2,775.97400** (-5,249.07300, -302.87490)
t = -2.20000
p = 0.02793
Kitchen.AbvGr -13,051.86000** (-24,242.05000, -1,861.68000)
t = -2.28604
p = 0.02236
Kitchen.QualFa -35,430.56000*** (-46,768.31000, -24,092.82000)
t = -6.12490
p = 0.00000
Kitchen.QualGd -37,934.88000*** (-44,089.20000, -31,780.56000)
t = -12.08111
p = 0.00000
Kitchen.QualTA -39,483.22000*** (-46,580.57000, -32,385.87000)
t = -10.90346
p = 0.00000
TotRms.AbvGrd 1,553.69400* (-174.07630, 3,281.46500)
t = 1.76249
p = 0.07815
Fireplaces 5,243.89500*** (2,730.50000, 7,757.29000)
t = 4.08923
p = 0.00005
Garage.TypeAttchd 10,760.05000 (-4,364.42100, 25,884.51000)
t = 1.39438
p = 0.16337
Garage.TypeBasment 15,279.79000 (-4,558.76000, 35,118.34000)
t = 1.50958
p = 0.13132
Garage.TypeBuiltIn 15,600.90000* (-697.97820, 31,899.77000)
t = 1.87603
p = 0.06080
Garage.TypeCarPort 5,377.45300 (-18,086.58000, 28,841.49000)
t = 0.44918
p = 0.65336
Garage.TypeDetchd 8,604.57100 (-6,408.82300, 23,617.96000)
t = 1.12331
p = 0.26145
Garage.Area 25.00584*** (16.00345, 34.00823)
t = 5.44417
p = 0.0000001
Wood.Deck.SF 4.67919 (-6.46395, 15.82234)
t = 0.82302
p = 0.41060
Open.Porch.SF -4.84744 (-26.26057, 16.56569)
t = -0.44369
p = 0.65732
Enclosed.Porch 11.40326 (-9.74494, 32.55147)
t = 1.05683
p = 0.29073
X3Ssn.Porch 40.87324 (-17.08306, 98.82953)
t = 1.38225
p = 0.16706
Screen.Porch 60.57027*** (37.69270, 83.44783)
t = 5.18917
p = 0.0000003
Mo.Sold -340.01420 (-809.94220, 129.91380)
t = -1.41812
p = 0.15632
Yr.Sold 2,352.76300 (-8,161.50600, 12,867.03000)
t = 0.43858
p = 0.66102
Sale.ConditionAdjLand 3,075.75500 (-19,059.80000, 25,211.31000)
t = 0.27234
p = 0.78540
Sale.ConditionAlloca 3,523.68300 (-11,356.82000, 18,404.18000)
t = 0.46412
p = 0.64262
Sale.ConditionFamily 6,487.67400 (-5,044.99900, 18,020.35000)
t = 1.10257
p = 0.27035
Sale.ConditionNormal 7,880.08300*** (2,538.89900, 13,221.27000)
t = 2.89162
p = 0.00388
Sale.ConditionPartial 21,144.76000*** (13,743.88000, 28,545.65000)
t = 5.59973
p = 0.0000001
Constant -5,597,142.00000 (-26,756,346.00000, 15,562,063.00000)
t = -0.51846
p = 0.60420
Observations 2,051
R2 0.87716
Adjusted R2 0.87126
Residual Std. Error 28,321.30000 (df = 1956)
F Statistic 148.59050*** (df = 94; 1956)
Note: p<0.1; p<0.05; p<0.01

6.2.1 Regression Diagnostic

Once a linear regression model is fitted, there are four primary assumptions that we should check:

  1. Linearity: The relationship between the predictors and the response variable should be linear.
  2. Independence: The residuals should be independent, especially in time series data.
  3. Homoscedasticity (Constant Variance): The variance of the residuals should be constant across all levels of the independent variables.
  4. Normality: The residuals should be approximately normally distributed. We commonly use diagnostic plots to diagnose potential violations of these assumptions in R. Here’s a breakdown of these plots and their descriptions:

6.2.1.1 Residual vs. Fitted (or Predicted) Values:

This plot helps check the assumptions of linearity and equal variance (homoscedasticity).

Description: A horizontal band around the 0 reference line indicates that the relationship is linear and the variances are equal. Patterns or trends suggest violations.

plot(full_lm_model, which=1)

6.2.1.2 Normal Q-Q Plot:

This plot checks the assumption of normality of the residuals.

Description: If the residuals are normally distributed, they should fall on a roughly straight line at a 45-degree angle. Deviations from this line indicate departures from normality.

plot(full_lm_model, which=2)
#> Warning: not plotting observations with leverage one:
#>   94, 541, 1048

6.2.1.3 Scale-Location (or Spread-Location) Plot:

This plot helps check the assumption of equal variance (homoscedasticity).

Description: A horizontal band with equally spread points indicates equal variances. A funnel shape (either narrow at the bottom or top) suggests that the variances change with the fitted values, violating the assumption.

plot(full_lm_model, which=3)
#> Warning: not plotting observations with leverage one:
#>   94, 541, 1048

6.2.1.4 Residuals vs. Leverage:

This plot helps identify influential case observations that have an undue influence on the regression equation.

Description: Points that stand out, especially those in the plot’s top right or bottom right, are influential for the regression equation. The Cook’s distance lines can help identify significant observations.

plot(full_lm_model, which=5)
#> Warning: not plotting observations with leverage one:
#>   94, 541, 1048

Remember, no real-world data will perfectly meet all these assumptions. The key is to identify major deviations that could bias the regression results. If any of these assumptions appear to be violated, further investigation and potentially other modeling approaches or transformations might be necessary.

6.2.1.5 Cook’s Distance

Cook’s Distance is a measure used in regression analysis to identify influential data points. An influential data point is an observation or set of observations that notably affects the regression equation. Such points can exert undue leverage on the estimated regression coefficients, potentially skewing the model.

Concept: Cook’s Distance combines the leverage (how extreme the input data is) and the residuals (how extreme the output is) into a single measure.

The measure quantifies how much all the fitted values would change if we excluded a particular observation from the dataset.

Calculation: It is computed for each observation and represents the scaled change in the fitted values when the observation is left out of the regression. Mathematically, it’s a function of each observation’s residual and leverage.

Interpretation: A common rule of thumb is that any observation with a Cook’s Distance more significant than one might be influential. However, this threshold might be too strict in practice, especially for larger datasets.

Another common practice is to look for points whose Cook’s Distance exceeds four times the mean of all Cook’s Distance values.

A large Cook’s Distance value indicates that the observation strongly influences the regression coefficients. Removing this observation would significantly change the estimated regression line.

It’s crucial not to automatically exclude data points based on Cook’s Distance alone. If an observation has a high Cook’s Distance, one should investigate the reasons for its influence. It could result from data entry or measurement errors or represent a legitimate, interesting outlier.

plot(full_lm_model, which=4)

6.3 Variable Selection Techniques in Multiple Regression

6.3.1 Forward Selection:

  • Starting Point: Begins with no predictors in the model.
  • Procedure: In each step, the predictor that results in the lowest residual sum of squares (RSS) when added to the model is included.
  • Stopping Point: Continues until a pre-specified stopping rule is met, like a certain number of variables, or until adding predictors no longer improves the model by a significant amount.
  • Advantages: Computationally efficient compared to best subset selection, especially when the number of predictors is large.
# Load necessary library
library(MASS)

# Fit the base model (with only the intercept)
base_model <- lm(SalePrice ~ 1, data=train_data)

# Perform forward selection
forward_selected_model <- step(base_model, 
                               scope=list(lower=~1, 
                                          upper=~Order + Lot.Frontage + Lot.Area + Lot.Shape + Utilities + 
                                                  Lot.Config + Land.Slope + Neighborhood + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add + Foundation + Bsmt.Unf.SF + Total.Bsmt.SF + BaseLivArea + 
                                                  Central.Air + X1st.Flr.SF + X2nd.Flr.SF + Gr.Liv.Area + Full.Bath + Half.Bath + Bathrooms + Bedroom.AbvGr + Kitchen.AbvGr + TotRms.AbvGrd + Fireplaces + Garage.Type + 
                                                  Garage.Area + Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch + 
                                                  X3Ssn.Porch + Screen.Porch + Mo.Sold + Yr.Sold + Sale.Condition),
                               direction="forward",
                               trace=0)  # trace=1 provides a step-by-step output

# View the final model
stargazer::stargazer(forward_selected_model, type = "html", out = "regression.html" ,title = "Simple Linear Model", ci=TRUE, single.row = TRUE, no.space = FALSE, align = TRUE, digits=5, font.size = "small",  report = "vc*stp")
Simple Linear Model
Dependent variable:
SalePrice
Overall.Qual 13,479.37000*** (11,692.66000, 15,266.07000)
t = 14.78649
p = 0.00000
Gr.Liv.Area 41.52258*** (33.27860, 49.76656)
t = 9.87178
p = 0.00000
NeighborhoodBlueste 2,843.10700 (-23,729.50000, 29,415.71000)
t = 0.20970
p = 0.83393
NeighborhoodBrDale 8,739.89200 (-11,680.53000, 29,160.31000)
t = 0.83886
p = 0.40165
NeighborhoodBrkSide -11,472.78000 (-29,118.76000, 6,173.19300)
t = -1.27430
p = 0.20271
NeighborhoodClearCr -13,620.00000 (-32,801.76000, 5,561.75800)
t = -1.39167
p = 0.16418
NeighborhoodCollgCr -11,981.53000 (-27,428.29000, 3,465.23400)
t = -1.52028
p = 0.12861
NeighborhoodCrawfor 8,512.60900 (-8,324.81200, 25,350.03000)
t = 0.99091
p = 0.32185
NeighborhoodEdwards -22,691.78000*** (-38,931.64000, -6,451.92000)
t = -2.73864
p = 0.00623
NeighborhoodGilbert -20,599.96000** (-36,717.68000, -4,482.23400)
t = -2.50502
p = 0.01233
NeighborhoodGreens 15,846.29000 (-11,110.37000, 42,802.95000)
t = 1.15215
p = 0.24940
NeighborhoodGrnHill 129,888.20000*** (70,614.59000, 189,161.80000)
t = 4.29493
p = 0.00002
NeighborhoodIDOTRR -15,758.82000* (-33,892.39000, 2,374.76400)
t = -1.70329
p = 0.08868
NeighborhoodLandmrk 10,907.48000 (-49,333.66000, 71,148.62000)
t = 0.35488
p = 0.72272
NeighborhoodMeadowV 6,726.82500 (-11,876.50000, 25,330.15000)
t = 0.70871
p = 0.47859
NeighborhoodMitchel -17,512.17000** (-33,927.43000, -1,096.90400)
t = -2.09093
p = 0.03667
NeighborhoodNAmes -14,891.41000* (-30,827.83000, 1,045.00900)
t = -1.83144
p = 0.06719
NeighborhoodNoRidge 36,220.40000*** (18,513.70000, 53,927.10000)
t = 4.00926
p = 0.00007
NeighborhoodNPkVill 4,549.64400 (-17,547.67000, 26,646.96000)
t = 0.40354
p = 0.68660
NeighborhoodNridgHt 48,342.12000*** (33,032.92000, 63,651.32000)
t = 6.18901
p = 0.00000
NeighborhoodNWAmes -20,007.26000** (-36,442.10000, -3,572.42800)
t = -2.38600
p = 0.01713
NeighborhoodOldTown -17,281.65000** (-34,403.30000, -159.98980)
t = -1.97828
p = 0.04804
NeighborhoodSawyer -15,950.72000* (-32,518.69000, 617.25640)
t = -1.88694
p = 0.05932
NeighborhoodSawyerW -19,437.82000** (-35,604.88000, -3,270.75400)
t = -2.35648
p = 0.01855
NeighborhoodSomerst 8,589.75300 (-6,587.08300, 23,766.59000)
t = 1.10930
p = 0.26744
NeighborhoodStoneBr 53,246.43000*** (36,270.27000, 70,222.60000)
t = 6.14751
p = 0.00000
NeighborhoodSWISU -16,388.01000* (-35,511.23000, 2,735.21600)
t = -1.67963
p = 0.09319
NeighborhoodTimber -11.38951 (-17,011.67000, 16,988.89000)
t = -0.00131
p = 0.99896
NeighborhoodVeenker 3,344.23400 (-16,698.45000, 23,386.92000)
t = 0.32703
p = 0.74368
BaseLivArea 25.57887*** (19.13377, 32.02396)
t = 7.77857
p = 0.00000
Bldg.Type2fmCon -5,936.95600 (-16,252.66000, 4,378.75100)
t = -1.12801
p = 0.25946
Bldg.TypeDuplex -2,557.69000 (-15,094.53000, 9,979.15300)
t = -0.39986
p = 0.68931
Bldg.TypeTwnhs -45,079.30000*** (-56,046.87000, -34,111.73000)
t = -8.05591
p = 0.00000
Bldg.TypeTwnhsE -32,478.51000*** (-39,499.60000, -25,457.43000)
t = -9.06650
p = 0.00000
Year.Built 372.81780*** (248.85920, 496.77640)
t = 5.89478
p = 0.00000
Sale.ConditionAdjLand 3,798.87300 (-18,793.65000, 26,391.39000)
t = 0.32956
p = 0.74177
Sale.ConditionAlloca 1,605.97800 (-13,556.25000, 16,768.21000)
t = 0.20760
p = 0.83557
Sale.ConditionFamily 6,843.60600 (-5,023.50300, 18,710.71000)
t = 1.13029
p = 0.25850
Sale.ConditionNormal 6,388.35500** (937.03400, 11,839.68000)
t = 2.29686
p = 0.02174
Sale.ConditionPartial 23,969.80000*** (16,429.00000, 31,510.59000)
t = 6.23010
p = 0.00000
Overall.Cond 5,168.43100*** (3,703.83200, 6,633.03100)
t = 6.91652
p = 0.00000
Lot.Area 0.52338*** (0.31905, 0.72772)
t = 5.02035
p = 0.000001
Screen.Porch 59.51841*** (36.20029, 82.83654)
t = 5.00272
p = 0.000001
Garage.Area 21.44998*** (12.96982, 29.93013)
t = 4.95759
p = 0.000001
House.Style1.5Unf 10,496.06000 (-6,278.54600, 27,270.67000)
t = 1.22637
p = 0.22021
House.Style1Story 11,751.13000*** (4,880.87600, 18,621.38000)
t = 3.35239
p = 0.00082
House.Style2.5Fin -27,978.63000** (-55,129.02000, -828.24650)
t = -2.01975
p = 0.04355
House.Style2.5Unf -5,474.84900 (-21,321.87000, 10,372.17000)
t = -0.67713
p = 0.49841
House.Style2Story -7,494.88700** (-13,589.77000, -1,400.00100)
t = -2.41017
p = 0.01604
House.StyleSFoyer 13,090.54000** (2,591.15400, 23,589.93000)
t = 2.44367
p = 0.01463
House.StyleSLvl 3,742.10200 (-4,796.90700, 12,281.11000)
t = 0.85893
p = 0.39049
Lot.ShapeIR2 8,123.22900* (-802.94190, 17,049.40000)
t = 1.78366
p = 0.07464
Lot.ShapeIR3 -29,048.06000*** (-45,007.06000, -13,089.06000)
t = -3.56746
p = 0.00037
Lot.ShapeReg 1,528.94500 (-1,664.16800, 4,722.05800)
t = 0.93848
p = 0.34812
Fireplaces 5,202.51400*** (2,643.28700, 7,761.74000)
t = 3.98431
p = 0.00008
Bathrooms 5,996.61700*** (2,018.21700, 9,975.01700)
t = 2.95424
p = 0.00318
Lot.Frontage -84.44456* (-168.94340, 0.05431)
t = -1.95870
p = 0.05029
Year.Remod.Add 97.99938* (-2.04115, 198.03990)
t = 1.91997
p = 0.05501
Land.SlopeMod 11,074.44000*** (4,382.30600, 17,766.56000)
t = 3.24344
p = 0.00121
Land.SlopeSev -5,074.33200 (-23,705.70000, 13,557.03000)
t = -0.53380
p = 0.59354
Bedroom.AbvGr -3,945.51400*** (-6,436.26200, -1,454.76600)
t = -3.10472
p = 0.00194
TotRms.AbvGrd 2,363.26400*** (586.93740, 4,139.59100)
t = 2.60758
p = 0.00919
Kitchen.AbvGr -13,944.52000** (-25,444.97000, -2,444.08200)
t = -2.37650
p = 0.01758
FoundationCBlock -2,626.30700 (-8,581.08400, 3,328.47000)
t = -0.86443
p = 0.38746
FoundationPConc 4,503.15500 (-2,144.87500, 11,151.18000)
t = 1.32761
p = 0.18446
FoundationSlab 1,685.56400 (-12,309.34000, 15,680.47000)
t = 0.23606
p = 0.81342
FoundationStone 8,073.04400 (-12,024.60000, 28,170.69000)
t = 0.78730
p = 0.43121
FoundationWood -15,500.73000 (-45,764.57000, 14,763.11000)
t = -1.00387
p = 0.31557
X2nd.Flr.SF 12.70646** (2.11361, 23.29932)
t = 2.35104
p = 0.01882
Lot.ConfigCulDSac 8,242.40000** (1,526.96600, 14,957.83000)
t = 2.40562
p = 0.01624
Lot.ConfigFR2 -4,743.18900 (-13,107.54000, 3,621.16500)
t = -1.11144
p = 0.26652
Lot.ConfigFR3 -1,202.19600 (-18,367.77000, 15,963.38000)
t = -0.13727
p = 0.89084
Lot.ConfigInside 1,917.46000 (-1,705.27200, 5,540.19100)
t = 1.03738
p = 0.29969
X3Ssn.Porch 49.63295 (-10.06617, 109.33210)
t = 1.62948
p = 0.10337
Bsmt.Unf.SF 5.39633 (-1.24957, 12.04224)
t = 1.59145
p = 0.11167
Constant -960,652.70000*** (-1,231,564.00000, -689,741.90000)
t = -6.95005
p = 0.00000
Observations 2,051
R2 0.86704
Adjusted R2 0.86200
Residual Std. Error 29,322.53000 (df = 1975)
F Statistic 171.72850*** (df = 75; 1975)
Note: p<0.1; p<0.05; p<0.01

6.3.2 Backward Elimination:

  • Starting Point: Begins with all predictors in the model.
  • Procedure: In each step, the predictor that is least significant (i.e., has the highest p-value) and does not contribute significantly to the model fit is removed.
  • Stopping Point: Continues until a stopping rule is met, often when all remaining predictors are significant at a specified alpha level.
  • Advantages: Like forward selection, it’s more computationally efficient than best subset selection.
# Load necessary library
library(MASS)

# Fit the full model (with all predictors)
full_model <- lm(SalePrice ~ Order + Lot.Frontage + Lot.Area + Lot.Shape + Utilities + 
                 Lot.Config + Land.Slope + Neighborhood + Bldg.Type + House.Style + 
                 Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add + 
                 Foundation + Bsmt.Unf.SF + Total.Bsmt.SF + BaseLivArea + 
                 Central.Air + X1st.Flr.SF + X2nd.Flr.SF + Gr.Liv.Area + Full.Bath + 
                 Half.Bath + Bathrooms + Bedroom.AbvGr + Kitchen.AbvGr + 
                 TotRms.AbvGrd + Fireplaces + Garage.Type + 
                 Garage.Area + Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch + 
                 X3Ssn.Porch + Screen.Porch + Mo.Sold + Yr.Sold + Sale.Condition, 
                 data=train_data)

# Perform backward selection
backward_selected_model <- step(full_model, 
                               direction="backward",
                               trace=0)  # trace=1 provides a step-by-step output

# View the final model
stargazer::stargazer(backward_selected_model, type = "html", out = "regression.html" ,title = "Simple Linear Model", ci=TRUE, single.row = TRUE, no.space = FALSE, align = TRUE, digits=5, font.size = "small",  report = "vc*stp")
Simple Linear Model
Dependent variable:
SalePrice
Lot.Frontage -84.42088* (-168.94220, 0.10045)
t = -1.95763
p = 0.05042
Lot.Area 0.52307*** (0.31859, 0.72755)
t = 5.01379
p = 0.000001
Lot.ShapeIR2 8,117.05100* (-812.18100, 17,046.28000)
t = 1.78169
p = 0.07496
Lot.ShapeIR3 -29,053.66000*** (-45,017.04000, -13,090.28000)
t = -3.56717
p = 0.00037
Lot.ShapeReg 1,530.75000 (-1,663.36000, 4,724.86000)
t = 0.93930
p = 0.34770
Lot.ConfigCulDSac 8,251.54900** (1,532.03200, 14,971.07000)
t = 2.40683
p = 0.01619
Lot.ConfigFR2 -4,731.23600 (-13,100.98000, 3,638.50300)
t = -1.10793
p = 0.26803
Lot.ConfigFR3 -1,200.73600 (-18,370.64000, 15,969.16000)
t = -0.13707
p = 0.89100
Lot.ConfigInside 1,915.98800 (-1,707.76700, 5,539.74300)
t = 1.03629
p = 0.30020
Land.SlopeMod 11,087.45000*** (4,388.77000, 17,786.12000)
t = 3.24407
p = 0.00120
Land.SlopeSev -5,028.89200 (-23,686.26000, 13,628.47000)
t = -0.52829
p = 0.59736
NeighborhoodBlueste 2,819.96000 (-23,763.19000, 29,403.11000)
t = 0.20791
p = 0.83532
NeighborhoodBrDale 8,823.58100 (-11,667.90000, 29,315.06000)
t = 0.84396
p = 0.39880
NeighborhoodBrkSide -11,468.41000 (-29,119.02000, 6,182.19700)
t = -1.27348
p = 0.20300
NeighborhoodClearCr -13,632.73000 (-32,820.92000, 5,555.46800)
t = -1.39250
p = 0.16393
NeighborhoodCollgCr -11,971.54000 (-27,423.42000, 3,480.33600)
t = -1.51851
p = 0.12905
NeighborhoodCrawfor 8,524.80200 (-8,318.54200, 25,368.14000)
t = 0.99198
p = 0.32133
NeighborhoodEdwards -22,681.15000*** (-38,926.42000, -6,435.87600)
t = -2.73644
p = 0.00627
NeighborhoodGilbert -20,581.13000** (-36,707.13000, -4,455.13000)
t = -2.50144
p = 0.01245
NeighborhoodGreens 15,836.51000 (-11,127.59000, 42,800.60000)
t = 1.15112
p = 0.24983
NeighborhoodGrnHill 130,061.10000*** (70,675.59000, 189,446.60000)
t = 4.29255
p = 0.00002
NeighborhoodIDOTRR -15,760.57000* (-33,898.73000, 2,377.58600)
t = -1.70305
p = 0.08872
NeighborhoodLandmrk 10,970.53000 (-49,298.42000, 71,239.48000)
t = 0.35676
p = 0.72131
NeighborhoodMeadowV 6,785.24800 (-11,858.04000, 25,428.54000)
t = 0.71333
p = 0.47573
NeighborhoodMitchel -17,505.57000** (-33,925.46000, -1,085.68400)
t = -2.08956
p = 0.03679
NeighborhoodNAmes -14,862.30000* (-30,812.96000, 1,088.35000)
t = -1.82623
p = 0.06797
NeighborhoodNoRidge 36,241.03000*** (18,525.26000, 53,956.79000)
t = 4.00949
p = 0.00007
NeighborhoodNPkVill 4,562.31700 (-17,541.94000, 26,666.57000)
t = 0.40454
p = 0.68587
NeighborhoodNridgHt 48,350.85000*** (33,036.85000, 63,664.84000)
t = 6.18819
p = 0.00000
NeighborhoodNWAmes -19,989.76000** (-36,432.31000, -3,547.22000)
t = -2.38280
p = 0.01728
NeighborhoodOldTown -17,286.82000** (-34,413.07000, -160.56650)
t = -1.97834
p = 0.04803
NeighborhoodSawyer -15,924.98000* (-32,504.81000, 654.83900)
t = -1.88255
p = 0.05991
NeighborhoodSawyerW -19,407.73000** (-35,589.62000, -3,225.83900)
t = -2.35068
p = 0.01884
NeighborhoodSomerst 8,604.93300 (-6,578.63100, 23,788.50000)
t = 1.11076
p = 0.26681
NeighborhoodStoneBr 53,257.01000*** (36,275.32000, 70,238.71000)
t = 6.14673
p = 0.00000
NeighborhoodSWISU -16,399.20000* (-35,528.48000, 2,730.08000)
t = -1.68024
p = 0.09307
NeighborhoodTimber -12.97331 (-17,017.55000, 16,991.60000)
t = -0.00150
p = 0.99881
NeighborhoodVeenker 3,409.32900 (-16,679.05000, 23,497.71000)
t = 0.33264
p = 0.73945
Bldg.Type2fmCon -5,963.46800 (-16,294.87000, 4,367.93800)
t = -1.13133
p = 0.25806
Bldg.TypeDuplex -2,540.91000 (-15,085.22000, 10,003.40000)
t = -0.39700
p = 0.69142
Bldg.TypeTwnhs -45,105.03000*** (-56,086.96000, -34,123.09000)
t = -8.04997
p = 0.00000
Bldg.TypeTwnhsE -32,507.15000*** (-39,552.44000, -25,461.85000)
t = -9.04331
p = 0.00000
House.Style1.5Unf 10,473.11000 (-6,311.75700, 27,257.97000)
t = 1.22294
p = 0.22150
House.Style1Story 11,722.86000*** (4,828.52700, 18,617.19000)
t = 3.33265
p = 0.00088
House.Style2.5Fin -27,978.77000** (-55,135.97000, -821.58150)
t = -2.01926
p = 0.04360
House.Style2.5Unf -5,404.63300 (-21,315.42000, 10,506.16000)
t = -0.66577
p = 0.50564
House.Style2Story -7,450.43500** (-13,608.96000, -1,291.90800)
t = -2.37112
p = 0.01783
House.StyleSFoyer 13,046.92000** (2,510.04300, 23,583.79000)
t = 2.42686
p = 0.01532
House.StyleSLvl 3,711.62500 (-4,850.44600, 12,273.70000)
t = 0.84964
p = 0.39564
Overall.Qual 13,476.23000*** (11,688.02000, 15,264.44000)
t = 14.77061
p = 0.00000
Overall.Cond 5,166.65000*** (3,701.26600, 6,632.03400)
t = 6.91044
p = 0.00000
Year.Built 373.42820*** (248.86100, 497.99550)
t = 5.87559
p = 0.00000
Year.Remod.Add 97.81441* (-2.31705, 197.94590)
t = 1.91461
p = 0.05569
FoundationCBlock -2,616.14400 (-8,575.75300, 3,343.46500)
t = -0.86038
p = 0.38969
FoundationPConc 4,500.01300 (-2,149.97000, 11,150.00000)
t = 1.32630
p = 0.18490
FoundationSlab 1,620.10900 (-12,437.13000, 15,677.35000)
t = 0.22589
p = 0.82132
FoundationStone 8,033.56600 (-12,084.04000, 28,151.17000)
t = 0.78267
p = 0.43392
FoundationWood -15,525.36000 (-45,800.65000, 14,749.93000)
t = -1.00508
p = 0.31499
Bsmt.Unf.SF 5.37722 (-1.28092, 12.03537)
t = 1.58290
p = 0.11361
BaseLivArea 25.56636*** (19.11498, 32.01775)
t = 7.76719
p = 0.00000
X2nd.Flr.SF 12.75196** (2.11889, 23.38504)
t = 2.35053
p = 0.01885
Gr.Liv.Area 41.53372*** (33.28478, 49.78267)
t = 9.86849
p = 0.00000
Full.Bath 6,028.74500*** (1,999.69500, 10,057.79000)
t = 2.93273
p = 0.00340
Half.Bath 2,826.39400 (-1,090.75800, 6,743.54700)
t = 1.41420
p = 0.15747
Bedroom.AbvGr -3,959.60400*** (-6,466.27700, -1,452.93000)
t = -3.09601
p = 0.00199
Kitchen.AbvGr -13,962.13000** (-25,470.64000, -2,453.61500)
t = -2.37783
p = 0.01751
TotRms.AbvGrd 2,363.35600*** (586.58330, 4,140.12900)
t = 2.60703
p = 0.00921
Fireplaces 5,213.14800*** (2,644.78400, 7,781.51300)
t = 3.97825
p = 0.00008
Garage.Area 21.43800*** (12.95246, 29.92354)
t = 4.95168
p = 0.000001
X3Ssn.Porch 49.54326 (-10.19677, 109.28330)
t = 1.62543
p = 0.10424
Screen.Porch 59.53482*** (36.20862, 82.86101)
t = 5.00236
p = 0.000001
Sale.ConditionAdjLand 3,777.04400 (-18,825.20000, 26,379.29000)
t = 0.32753
p = 0.74331
Sale.ConditionAlloca 1,632.42900 (-13,542.48000, 16,807.34000)
t = 0.21084
p = 0.83304
Sale.ConditionFamily 6,853.62800 (-5,018.08700, 18,725.34000)
t = 1.13150
p = 0.25799
Sale.ConditionNormal 6,389.47000** (936.73780, 11,842.20000)
t = 2.29667
p = 0.02175
Sale.ConditionPartial 23,977.05000*** (16,433.02000, 31,521.08000)
t = 6.22932
p = 0.00000
Constant -961,415.00000*** (-1,232,806.00000, -690,023.50000)
t = -6.94325
p = 0.00000
Observations 2,051
R2 0.86705
Adjusted R2 0.86193
Residual Std. Error 29,329.89000 (df = 1974)
F Statistic 169.38410*** (df = 76; 1974)
Note: p<0.1; p<0.05; p<0.01

6.3.3 Best Subset Selection:

  • Procedure: Considers all possible combinations of predictors and fits a model for each combination.
  • Selection Criteria: The “best” model is selected based on a criterion like RSS, Akaike information criterion (AIC), Bayesian information criterion (BIC), or adjusted R-squared.
  • Advantages: Can find the optimal model based on the selected criterion.
  • Disadvantages: Computationally intensive, especially as the number of predictors grows. For example, with p predictors, there are 2^p possible models to evaluate. For this reason, it’s often not feasible for datasets with many predictors. For this dataset, I am keeping the maximum number of variables in the model to be 34, seven variables less than the entire dataset. Adjusting this number in the code below will make selecting max variables in the model easy.
# Load necessary libraries
library(leaps)
library(tidyverse)

# Perform best subset selection
# best_subsets <- regsubsets(SalePrice ~ Order + Lot.Frontage + Lot.Area + Lot.Shape + Utilities + Lot.Config + Land.Slope + Neighborhood + Bldg.Type + House.Style + 
#                            Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add + Foundation + Bsmt.Unf.SF + Total.Bsmt.SF + BaseLivArea + Central.Air + X1st.Flr.SF + X2nd.Flr.SF + Gr.Liv.Area + Full.Bath + 
#                            Half.Bath + Bathrooms + Bedroom.AbvGr + Kitchen.AbvGr + 
#        TotRms.AbvGrd + Fireplaces + Garage.Type + Garage.Area + Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch + X3Ssn.Porch + Screen.Porch + Mo.Sold + Yr.Sold + Sale.Condition, 
#                            data=train_data, nvmax=ncol(train_data)-7,really.big=T)

# # Extracting the summary
# best_subsets_summary <- summary(best_subsets)

# # Displaying the best model for each number of predictors based on RSS (or adjust for other criteria)
# data.frame(
#   predictors = 1:(ncol(train_set)-1),
#   adjr2 = best_subsets_summary$adjr2,
#   rss = best_subsets_summary$rss,
#   cp = best_subsets_summary$cp,
#   bic = best_subsets_summary$bic
# ) %>% 
#   arrange(desc(adjr2)) %>% 
#   head(1)

# # Identify the best model size based on highest adjusted R-squared
# best_model_size <- which.min(best_subsets_summary$bic)

# # Extract the variables included in the best model
# best_model_vars <- names(which(summary_results$which[best_model_size, ]))

# # Construct the formula
# best_formula <- as.formula(paste("y ~", paste(best_model_vars[-1], collapse = " + ")))

# print(best_formula)

In practice, the choice of method depends on: - The number of predictors (with many predictors, the best subset becomes computationally infeasible). - The primary goal (prediction accuracy vs. interpretability). - Computational resources.

While best subset selection can find the most optimal model, forward and backward selection are often preferred due to their computational efficiency and the ability to produce more straightforward, more interpretable models.

We need to consider a selection criterion to compare the best models from forward, backward, and best subset selection methods. The AIC (Akaike Information Criterion) is a commonly used criterion that penalizes the addition of unnecessary predictors to a model. A lower AIC suggests a better-fitting model. Another criterion you could use is the adjusted R^2, which adjusts R^2 based on the number of predictors in the model. A higher adjusted R^2 suggests a better-fitting model.

Let’s Extract the best models from all the procedures.

library(MASS)
library(leaps)

# Forward Selection
forward_aic <- AIC(forward_selected_model)

# Backward Selection
backward_aic <- AIC(backward_selected_model)

# Best Subset Selection

# best_model_size <- which.min(best_subset_summary$cp) # Replace cp with bic or adjr2 as needed
# best_formula <- as.formula(best_subsets$call[[2]][[2]][[best_model_size]])
# best_subset_selected_model <- lm(best_formula, data=train_data)
# best_subset_aic <- AIC(best_subset_selected_model)

# Comparing AIC values to identify the best model
# aic_values <- c(Forward=forward_aic, Backward=backward_aic, BestSubset=best_subset_aic)
aic_values <- c(Forward=forward_aic, Backward=backward_aic)
best_method <- names(which.min(aic_values))

The best model selection method based on AIC is Forward

7 Prediction Over Test

# Predictions using the backward_selected_model
predicted_values_forward <- predict(forward_selected_model, newdata = test_data)

# Compare predicted values to actual SalePrice values in the test set
actual_values_test <- test_data$SalePrice
comparison <- data.frame(Actual = actual_values_test, Predicted = predicted_values_forward)

# Calculate Mean Squared Error
mse_forward <- mean((predicted_values_forward - actual_values_test)^2)
rmse <- sqrt(mse_forward)

# Display the MSE and first few rows of the comparison

head(comparison) %>% kbl(fullwidth=F)
Actual Predicted
3 159000 182796.9
6 190000 194751.0
15 201800 216285.2
19 100000 123190.2
22 235876 247688.0
26 172000 159412.5

The Root Mean Squared Error for the Forward Selection Model is 29398.9693.

7.1 Visualizations

library(ggplot2)

# Plotting actual vs. predicted values
p <- ggplot(comparison, aes(x=Actual, y=Predicted)) +
  geom_point(aes(color = abs(Actual - Predicted)), size = 3, alpha = 0.7) +  # Color points by residual magnitude
  geom_abline(intercept=0, slope=1, linetype="dashed", color="red") +        # Line of perfect prediction
  scale_color_viridis_c(option = "E", direction = -1) + 
  labs(
    title = "Actual vs Predicted SalePrice",
    x = "Actual SalePrice",
    y = "Predicted SalePrice",
    color = "Residual Magnitude"
  ) +
  theme_minimal()
p

# full_lm_model <- lm(cnt ~ .-causal-registerd, data = train_data)

8 Equations

for class explaination only, not part of the analysis. ## Causal

\begin{align} ln(y) &= \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2} \implies y= e^{\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}}\\ \sqrt{y} &=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\implies y = (\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})^2== \beta_0^2+\beta_1^2+2*\beta_1\beta_2\\ \sqrt[3]{y} &=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\\ cnt &= 161.807 +85.5765 *temp + 314.3430 *atemp - 275.1803 *hum+ 43.000* windspeed\\ log(cnt) &= 1.9205 + 0.043330139 *temp + 1.287265532 *atemp - 0.979846625 *hum+ * windspeed\\ then\; my\; Y\; is \end{align}